Topical and Lexical Similarity
Much criticism of Donald Trump has centered on him being unfit for the Office of President of the United States, or his “unpresidentiality”.
Some of this criticism also emerges from Donald Trump’s rhetorical style, which has also been deemed “unpresidential”.
But what does “Presidentiality” mean? Are there common traits, character qualities, rhetorical styles, or other elements that are common to US Presidents?
Have US Presidents throughout history given similar speeches and official addresses to each other?
What have been the most common topics they have given speeches on and how have they changed over time?
The Miller Center at the University of Virginia’s ‘Presidential Speeches’ has speeches available from George Washington till today are available in text format
Most speeches are official addresses, remarks, or statements
Available free to download as JSON file format to the public
Collection is not exhaustive, it is extensive and contains over 1,000 speeches within it
Temporal shifts in American society; Realignment in American domestic/foreign policy
Only Presidents from 20th Century onward - Theodore Roosevelt (1901)
Only speeches while President, no campaigns or other speeches, consistency
Historical trend of topics relevant today but not include archaic topics (slavery, railroads, etc.)
Minimal cleaning to maintain semantic and contextual coherence for my models
BERTopic groups similar speeches using language patterns which identified key themes automatically with advanced language models
Looks at context of words in relation to each other to find and build common topics
Measures how similar speeches are by comparing them in vector space
Uses word embeddings to capture semantic meaning, not just keywords
Helps identify subtle language patterns and thematic connections
Compares speeches based on word frequency adjusted by overall rarity
Effective for spotting shared vocabulary across texts
Less suited for capturing semantic meaning compared to embedding strategy
Firstly, that the proportion of topic prevalence in Presidential speeches is time dependent – it fluctuates in accordance with changes in the political spheres, either global or domestic.
Secondly, both with content and vocabulary, there is a similarity between Presidents from Coolidge until Clinton, after which there is a break and a set of new similarities that begin.
Thirdly and lastly, while Donald Trump’s rhetorical choices have been criticized as the great break from previous Presidents, this is only verifiable in terms of topic/thematic consistency and not in terms of vocabulary.
This is a technical appendix for the operations performed to create this memo.
# We only want to keep Presidents who start from 1900 on and drop all others
keep_presidents = [
"Theodore Roosevelt", "William Taft", "Woodrow Wilson",
"Warren G. Harding", "Calvin Coolidge", "Herbert Hoover",
"Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower",
"John F. Kennedy", "Lyndon B. Johnson", "Richard M. Nixon",
"Gerald Ford", "Jimmy Carter", "Ronald Reagan",
"George H. W. Bush", "Bill Clinton", "George W. Bush",
"Barack Obama", "Donald Trump", "Joe Biden"
]
df_new = df[df["president"].isin(keep_presidents)]
df_new = df_new.drop(['doc_name', 'title'], axis=1)
# Chronological List
df_new.sort_values("date", inplace=True)
df_new.reset_index(drop=True, inplace=True)
# Define the dates in which the President came to office and when they left
president_terms = {
"Theodore Roosevelt": ("1901-09-14", "1909-03-04"),
"William Taft": ("1909-03-04", "1913-03-04"),
"Woodrow Wilson": ("1913-03-04", "1921-03-04"),
"Warren G. Harding": ("1921-03-04", "1923-08-02"),
"Calvin Coolidge": ("1923-08-02", "1929-03-04"),
"Herbert Hoover": ("1929-03-04", "1933-03-04"),
"Franklin D. Roosevelt": ("1933-03-04", "1945-04-12"),
"Harry S. Truman": ("1945-04-12", "1953-01-20"),
"Dwight D. Eisenhower": ("1953-01-20", "1961-01-20"),
"John F. Kennedy": ("1961-01-20", "1963-11-22"),
"Lyndon B. Johnson": ("1963-11-22", "1969-01-20"),
"Richard M. Nixon": ("1969-01-20", "1974-08-09"),
"Gerald Ford": ("1974-08-09", "1977-01-20"),
"Jimmy Carter": ("1977-01-20", "1981-01-20"),
"Ronald Reagan": ("1981-01-20", "1989-01-20"),
"George H. W. Bush": ("1989-01-20", "1993-01-20"),
"Bill Clinton": ("1993-01-20", "2001-01-20"),
"George W. Bush": ("2001-01-20", "2009-01-20"),
"Barack Obama": ("2009-01-20", "2017-01-20"),
"Donald Trump": ("2017-01-20", "2021-01-20"),
"Joe Biden": ("2021-01-20", "2025-01-20"),
"Donald Trump": ("2024-01-20", "2025-04-27"),
}
# Drop speeches by any president in which they were not actively in office, ensure only presidential speeches in our data
def was_president_at_time(row):
pres = row['president']
date = row['date']
if pres in president_terms:
start, end = president_terms[pres]
return start <= date <= end
return False
df_proper = df_new[df_new.apply(was_president_at_time, axis=1)].reset_index(drop=True)
# Create chunks of 300 words of the speeches
def chunk_text(text, max_words=300):
words = text.split()
return [' '.join(words[i:i+max_words]) for i in range(0, len(words), max_words)]
df_proper['chunks'] = df_proper['cleaned_text'].apply(chunk_text)
# Create separate dataframe of all chunks flattened to analyze
docs_chunked = [chunk for chunks in df_proper['chunks'] for chunk in chunks]
# Initialize BERTopic Model and apply it to the dataframe of just chunks
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import contextlib
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
umap_model = UMAP(random_state=42)
#embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
#vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(
#embedding_model=embedding_model,
#vectorizer_model=vectorizer_model, # additional models that could be used included here
calculate_probabilities=True,
verbose=False,
umap_model = umap_model,
#top_n_words=7,
#nr_topics="auto",
)
topics, probs = topic_model.fit_transform(docs_chunked)
# Get information on the topics identified by BERTopic, counts for each topic, associated keywords
topic_info = topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]
topic_info_simple = topic_info[["Topic", 'Count', "Representation"]]
# Remove spaces and joins keywords together for better visibility
topic_info_simple['Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))
frequency_table = pd.DataFrame(topic_info_simple)
# Create a new Dataframe of just chunked text with requisite column names, fashioned just like our old DF
chunked_data = []
for idx, row in df_proper.iterrows():
chunks = chunk_text(row['cleaned_text'], max_words=300)
for chunk in chunks:
chunked_data.append({
"original_speech_id": row['speech_id'],
"president": row['president'],
"date": row['date'],
"transcript": chunk
})
df_chunked = pd.DataFrame(chunked_data)
# Removes Topic -1 (topics with keywords such as 'the', 'of', 'a', etc) and replaces chunks with the next most possible Topic that fits
## Ensures that speech chunks all have the best associated topic and not this catch all extra words
excluded_topics = [-1]
def reassign_topic(topic, prob_row):
if topic in excluded_topics:
sorted_indices = np.argsort(prob_row)[::-1]
for idx in sorted_indices:
if idx not in excluded_topics:
return idx
return topic
else:
return topic
df_chunked["topic"] = [
reassign_topic(t, p) for t, p in zip(df_chunked["topic"], probs)
]
# Map topic labels next to the Topic for better visualization
# Attaches the year a speech was given to each column
topic_labels = {
row["Topic"]: row["Representation"]
for _, row in topic_info_simple.iterrows()
}
df_chunked["topic_label"] = df_chunked["topic"].map(topic_labels)
df_chunked['year'] = pd.to_datetime(df_chunked['date']).dt.year
# Count the number of speeches for each given topic by year
df_count_by_year = df_top_5.groupby(['year', 'topic']).size().reset_index(name='count')
df_total_by_year = df_chunked.groupby('year').size().reset_index(name='total')
df_count_by_year = pd.merge(df_count_by_year, df_total_by_year, on='year')
# Create a proportion of the number of speeches with topics counted by number of total speeches
## Accounts for years where there may be less speeches or more
df_count_by_year['proportion'] = df_count_by_year['count'] / df_count_by_year['total']
df_count_by_year['topic_labels'] = df_count_by_year["topic"].map(topic_labels)
# Create summary labels for each of the following topics for easier visualization on a graph
manual_labels = {
0: "Vietnam War",
1: "Health Care",
7: "Banks, Credit, Gold",
2: "Peace, Nations, War",
3: "Rights, Blacks, White",
}
df_count_by_year['manual_labels'] = df_count_by_year["topic"].map(manual_labels)
# Create an area graph for the 5 topics together
library(ggplot2)
count_by_year <- reticulate::py$df_count_by_year
ggplot(count_by_year, aes(x = year, y = proportion, fill = manual_labels)) +
geom_area() +
theme_minimal() +
labs(title = "Top 5 Topics in Presidential Speeches Over Time",
x = "Year",
y = "Proportion of Speeches",
fill = "Topic") +
scale_fill_viridis_d() +
theme(plot.title = element_text(size = 12,face='bold'),
legend.position = "bottom",
legend.text = element_text(size = 6))
# Create line graph for each topic/graph by itself
ggplot(count_by_year, aes(x = year, y = proportion)) +
geom_line(aes(color = manual_labels)) +
facet_wrap(~ manual_labels, scales = "free_y") +
theme_minimal() +
labs(title = "Top 5 Topics in Presidential Speeches Over Time",
x = "Year",
y = "Proportion of Speeches",
color = "Topic") +
scale_color_viridis_d() +
theme(
legend.position = "none",
plot.title = element_text(size = 10,face='bold')
)
# Print a heatmap for easier visualization using the viridis color scale for best visualization
library(reshape2)
library(viridis)
similarity_matrix <- reticulate::py$similarity_df
similarity_matrix <- as.matrix(similarity_matrix)
melted_matrix <- melt(similarity_matrix, varnames = c("president_1", "president_2"))
ggplot(melted_matrix, aes(x = president_1, y = president_2, fill = value)) +
geom_tile(color = "white", linewidth = 0.3) +
scale_fill_viridis(
option = "viridis", # Try "magma", "plasma", or "inferno" for other easily visualizable variants
direction = -1,
limits = c(min(melted_matrix$value), max(melted_matrix$value))
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
axis.text.y = element_text(size = 10),
legend.position = "right",
plot.title = element_text(size = 10,face='bold')
) +
labs(
x = "President",
y = "President",
title = "Word Embedding - Cosine Similarity of Presidential Speeches",
fill = "Similarity"
) +
coord_fixed()
# Import vectorizer and list of stopwords to clean our text
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Define function to remove stopwords from a text
def remove_stopwords(text):
words = text.split()
filtered_words = [word for word in words if word.lower() not in stop_words]
return " ".join(filtered_words)
# Join all text into a singular string
presidents_aggregated = df_proper.groupby('president')['cleaned_text'].apply(" ".join).reset_index()
# Cleans the text by removing stopwords
presidents_aggregated['cleaned_text'] = presidents_aggregated['cleaned_text'].apply(remove_stopwords)
# Vectorize the text and save the vectorized text seperately
vectorizer = CountVectorizer()
president_dfm = vectorizer.fit_transform(presidents_aggregated['cleaned_text'])
# Creates another heatmap in the same style as the previous one, using viridis for ease of visualization
similarity_matrix_2 <- reticulate::py$similarity_df_2
similarity_matrix_2 <- as.matrix(similarity_matrix_2)
melted_matrix_2 <- melt(similarity_matrix_2, varnames = c("president_1", "president_2"))
ggplot(melted_matrix_2, aes(x = president_1, y = president_2, fill = value)) +
geom_tile(color = "white", linewidth = 0.3) +
scale_fill_viridis(
option = "viridis", # Try "magma", "plasma", or "inferno" for other easily visualizable variants
direction = -1,
limits = c(min(melted_matrix_2$value), max(melted_matrix_2$value))
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
axis.text.y = element_text(size = 10),
legend.position = "right",
plot.title = element_text(size = 10, face='bold')
) +
labs(
x = "President",
y = "President",
title = "TF-IDF Cosine Similarity of Presidential Speeches",
fill = "Similarity"
) +
coord_fixed()
Singh (JCU): Presidential Speeches